Scaling Up Word Clustering

نویسندگان

Jon Dehdari

Liling Tan

Josef van Genabith

چکیده

Word clusters improve performance in many NLP tasks including training neural network language models, but current increases in datasets are outpacing the ability of word clusterers to handle them. In this paper we present a novel bidirectional, interpolated, refining, and alternating (BIRA) predictive exchange algorithm and introduce ClusterCat, a clusterer based on this algorithm. We show that ClusterCat is 3–85 times faster than four other well-known clusterers, while also improving upon the predictive exchange algorithm’s perplexity by up to 18% . Notably, ClusterCat clusters a 2.5 billion token English News Crawl corpus in 3 hours. We also evaluate in a machine translation setting, resulting in shorter training times achieving the same translation quality measured in BLEU scores. ClusterCat is portable and freely available.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Word clustering effect on vocabulary learning of EFL learners: A case of semantic versus phonological clustering

The aim of this study is to determine the effect of word clustering method on vocabulary learning of Iranian EFL learners through a case of semantic versus phonological clustering. To this effect, 80 homogeneous students from four intermediate classes at an English institute in Torbat e Heydariyeh participated in this research. They were assigned to four groups according to semantic versus phon...

متن کامل

A Synchronic Lexical Study of Gbe Language Varieties: The Effects of Different Similarity Judgment Criteria

In the context of a synchronic lexical study of the Gbe varieties of West Africa, this paper explores the question whether the use of different criteria sets to judge the similarity of lexical features in different language varieties yields the same or different conclusions regarding the relative relationships and clustering of the investigated varieties and the prioritization of further sociol...

متن کامل

Fuzzy Clustering Approach Using Data Fusion Theory and its Application To Automatic Isolated Word Recognition

In this paper, utilization of clustering algorithms for data fusion in decision level is proposed. The results of automatic isolated word recognition, which are derived from speech spectrograph and Linear Predictive Coding (LPC) analysis, are combined with each other by using fuzzy clustering algorithms, especially fuzzy k-means and fuzzy vector quantization. Experimental results show that the...

متن کامل

Offline Language-free Writer Identification based on Speeded-up Robust Features

This article proposes offline language-free writer identification based on speeded-up robust features (SURF), goes through training, enrollment, and identification stages. In all stages, an isotropic Box filter is first used to segment the handwritten text image into word regions (WRs). Then, the SURF descriptors (SUDs) of word region and the corresponding scales and orientations (SOs) are extr...

متن کامل

Development of Meaning Structure by Usage-based Word Relationships

Development of meaning structure is studied from a usage-based viewpoint by a constructive approach. The meaning structure is represented by relationships between words. A word's relationship to other words, which represents meanings of the word, is derived by analyzing similarity of the word's usage in sentences. Words make clusters according to their similarity. The word clusters are classi e...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Scaling Up Word Clustering

نویسندگان

چکیده

منابع مشابه

Word clustering effect on vocabulary learning of EFL learners: A case of semantic versus phonological clustering

A Synchronic Lexical Study of Gbe Language Varieties: The Effects of Different Similarity Judgment Criteria

Fuzzy Clustering Approach Using Data Fusion Theory and its Application To Automatic Isolated Word Recognition

Offline Language-free Writer Identification based on Speeded-up Robust Features

Development of Meaning Structure by Usage-based Word Relationships

عنوان ژورنال:

اشتراک گذاری